1 Introduction

Ever heard of Mobike, ofo and oBike before? All three of them are bike rental services which employ the use of smartphone applications to create station-less and cashless bike rentals possible to users. By using such platforms, bike rentals can be automated and tracked.

Objective:
To predict the total number of bike rentals each day based on measurable variables such as weather conditions and date-time.

2 Data Preparation

Data is first cleaned by removing columns that are not useful, converting variables to appropriate type and checked for missing values. Especially for target variable, count, the whole row of data has to be removed if there are any missing values as the target variable should not be manipulated.

## 'data.frame':    731 obs. of  16 variables:
##  $ instant   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ dteday    : Factor w/ 731 levels "2011-01-01","2011-01-02",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ season    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ yr        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ mnth      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ holiday   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ weekday   : int  6 0 1 2 3 4 5 6 0 1 ...
##  $ workingday: int  0 0 1 1 1 1 1 0 0 1 ...
##  $ weathersit: int  2 2 1 1 1 1 2 2 1 1 ...
##  $ temp      : num  0.344 0.363 0.196 0.2 0.227 ...
##  $ atemp     : num  0.364 0.354 0.189 0.212 0.229 ...
##  $ hum       : num  0.806 0.696 0.437 0.59 0.437 ...
##  $ windspeed : num  0.16 0.249 0.248 0.16 0.187 ...
##  $ casual    : int  331 131 120 108 82 88 148 68 54 41 ...
##  $ registered: int  654 670 1229 1454 1518 1518 1362 891 768 1280 ...
##  $ cnt       : int  985 801 1349 1562 1600 1606 1510 959 822 1321 ...
## Skim summary statistics
##  n obs: 731 
##  n variables: 16 
## 
## -- Variable type:factor ---------------------------------------------------------------------------------------------------
##  variable missing complete   n n_unique                     top_counts
##    dteday       0      731 731      731 201: 1, 201: 1, 201: 1, 201: 1
##  ordered
##    FALSE
## 
## -- Variable type:integer --------------------------------------------------------------------------------------------------
##    variable missing complete   n     mean      sd p0    p25  p50    p75
##      casual       0      731 731  848.18   686.62  2  315.5  713 1096  
##         cnt       0      731 731 4504.35  1937.21 22 3152   4548 5956  
##     holiday       0      731 731    0.029    0.17  0    0      0    0  
##     instant       0      731 731  366      211.17  1  183.5  366  548.5
##        mnth       0      731 731    6.52     3.45  1    4      7   10  
##  registered       0      731 731 3656.17  1560.26 20 2497   3662 4776.5
##      season       0      731 731    2.5      1.11  1    2      3    3  
##  weathersit       0      731 731    1.4      0.54  1    1      1    2  
##     weekday       0      731 731    3        2     0    1      3    5  
##  workingday       0      731 731    0.68     0.47  0    0      1    1  
##          yr       0      731 731    0.5      0.5   0    0      1    1  
##  p100     hist
##  3410 <U+2587><U+2587><U+2585><U+2582><U+2581><U+2581><U+2581><U+2581>
##  8714 <U+2582><U+2585><U+2585><U+2587><U+2587><U+2585><U+2585><U+2582>
##     1 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
##   731 <U+2587><U+2587><U+2587><U+2587><U+2587><U+2587><U+2587><U+2587>
##    12 <U+2587><U+2585><U+2587><U+2583><U+2585><U+2587><U+2585><U+2587>
##  6946 <U+2581><U+2585><U+2585><U+2586><U+2587><U+2585><U+2583><U+2583>
##     4 <U+2587><U+2581><U+2587><U+2581><U+2581><U+2587><U+2581><U+2587>
##     3 <U+2587><U+2581><U+2581><U+2585><U+2581><U+2581><U+2581><U+2581>
##     6 <U+2587><U+2587><U+2587><U+2587><U+2581><U+2587><U+2587><U+2587>
##     1 <U+2583><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2587>
##     1 <U+2587><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581><U+2587>
## 
## -- Variable type:numeric --------------------------------------------------------------------------------------------------
##   variable missing complete   n mean    sd    p0  p25  p50  p75 p100
##      atemp       0      731 731 0.47 0.16  0.079 0.34 0.49 0.61 0.84
##        hum       0      731 731 0.63 0.14  0     0.52 0.63 0.73 0.97
##       temp       0      731 731 0.5  0.18  0.059 0.34 0.5  0.66 0.86
##  windspeed       0      731 731 0.19 0.077 0.022 0.13 0.18 0.23 0.51
##      hist
##  <U+2581><U+2585><U+2587><U+2587><U+2587><U+2587><U+2586><U+2581>
##  <U+2581><U+2581><U+2581><U+2583><U+2587><U+2587><U+2585><U+2582>
##  <U+2581><U+2583><U+2587><U+2586><U+2586><U+2587><U+2587><U+2582>
##  <U+2582><U+2586><U+2587><U+2586><U+2582><U+2581><U+2581><U+2581>
##  [1] "instant"    "dteday"     "season"     "yr"         "mnth"      
##  [6] "holiday"    "weekday"    "workingday" "weathersit" "temp"      
## [11] "atemp"      "hum"        "windspeed"  "casual"     "registered"
## [16] "cnt"
## 'data.frame':    731 obs. of  13 variables:
##  $ dteday    : Factor w/ 731 levels "2011-01-01","2011-01-02",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ season    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ yr        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ mnth      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ holiday   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ weekday   : int  6 0 1 2 3 4 5 6 0 1 ...
##  $ workingday: int  0 0 1 1 1 1 1 0 0 1 ...
##  $ weathersit: int  2 2 1 1 1 1 2 2 1 1 ...
##  $ temp      : num  0.344 0.363 0.196 0.2 0.227 ...
##  $ atemp     : num  0.364 0.354 0.189 0.212 0.229 ...
##  $ hum       : num  0.806 0.696 0.437 0.59 0.437 ...
##  $ windspeed : num  0.16 0.249 0.248 0.16 0.187 ...
##  $ cnt       : int  985 801 1349 1562 1600 1606 1510 959 822 1321 ...
##     dteday     season         yr       mnth    holiday    weekday 
##          0          0          0          0          0          0 
## workingday weathersit       temp      atemp        hum  windspeed 
##          0          0          0          0          0          0 
##        cnt 
##          0

3 EDA

Data is explored with simple visualisation of variables by categories: Date-Time and Weather Conditions.

  1. Date-Time Variables
    Image Credit: Designed by Freepik

  1. By Weather Variables

4 Model Fitting

Machine Learning algorithms are used to build models, namely linear regression and random forest. Data is processed according the requirements of each algorithm. The model is cross-validated by splitting the data into train, validation and test sets to carry out evaluation via regression metrics (R squared and RMSE).

4.1 Linear Regression

Base model is created with all inputs using linear regression. It will form the baseline for subsequent comparison of models. Linear regression used to build 2 models with different inputs, so as to focus on different aspects of modelling.

## [1] 731  12
## [1] 511  12
## [1] 110  12
## [1] 110  12
## 
## Call:
## lm(formula = count ~ ., data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3764.5  -345.5    58.2   454.5  2881.0 
## 
## Coefficients: (1 not defined because of singularities)
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       2417.73     274.95   8.793  < 2e-16 ***
## season2            875.73     211.70   4.137 4.16e-05 ***
## season3            622.70     252.31   2.468 0.013931 *  
## season4           1433.96     220.08   6.516 1.82e-10 ***
## year1             2065.95      70.63  29.250  < 2e-16 ***
## month2              39.75     170.92   0.233 0.816182    
## month3             568.22     200.17   2.839 0.004721 ** 
## month4             359.42     296.90   1.211 0.226648    
## month5             703.67     317.54   2.216 0.027155 *  
## month6             442.64     333.99   1.325 0.185692    
## month7              38.57     372.00   0.104 0.917453    
## month8             418.34     361.69   1.157 0.247999    
## month9            1108.92     310.11   3.576 0.000384 ***
## month10            528.49     285.96   1.848 0.065196 .  
## month11            -87.42     276.98  -0.316 0.752429    
## month12            -61.14     221.87  -0.276 0.783015    
## holiday1          -674.70     224.47  -3.006 0.002788 ** 
## weekdayTuesday      42.10     129.34   0.325 0.744959    
## weekdayWednesday   138.93     128.64   1.080 0.280671    
## weekdayThursday     94.36     130.66   0.722 0.470561    
## weekdayFriday      221.46     134.83   1.643 0.101136    
## weekdaySaturday    148.56     133.33   1.114 0.265731    
## weekdaySunday     -156.60     128.89  -1.215 0.224967    
## workingday1            NA         NA      NA       NA    
## weathersit        -567.52      88.37  -6.422 3.22e-10 ***
## temp              3159.17    1509.23   2.093 0.036849 *  
## atemp             1692.18    1564.79   1.081 0.280054    
## humid            -1647.26     368.26  -4.473 9.62e-06 ***
## windspd          -3110.50     494.90  -6.285 7.33e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 772.9 on 483 degrees of freedom
## Multiple R-squared:  0.8517, Adjusted R-squared:  0.8434 
## F-statistic: 102.7 on 27 and 483 DF,  p-value: < 2.2e-16
## Warning in predict.lm(fit_base, valid): prediction from a rank-deficient
## fit may be misleading
##     pred_base valid.count
## 245 4658.6684        4727
## 195 4298.3181        5084
## 580 7136.4956        7261
## 311 3695.1139        4035
## 695 5079.0197        2424
## 611 6680.9761        5810
## 723 3651.1547        1787
## 204 5471.9194        3285
## 615 6706.4789        6203
## 199 4317.4651        4458
## 319 3449.0964        4195
## 418 4344.9917        4773
## 429 4164.8387        3423
## 113 2701.0091        4036
## 41  1282.5796        1538
## 312 3822.9247        4205
## 697 4100.3736        3959
## 317 3530.1051        3717
## 517 7387.1097        7338
## 203 5465.8139        3387
## 708 4918.4981        5582
## 601 7017.2299        7765
## 702 4527.4067        4649
## 585 6414.0011        7273
## 57  1942.0977        1969
## 315 2572.9160        3368
## 154 4971.5249        5312
## 668 4313.3786          22
## 384 3428.0297        3292
## 149 4409.2423        4788
## 294 4467.0081        4304
## 151 5752.5081        3982
## 706 5246.8802        5375
## 715 5632.0520        5047
## 284 4224.3734        4563
## 479 3902.6284        3214
## 452 5897.8566        5102
## 242 4666.8476        5204
## 70  1793.7724        1977
## 29  1452.8368        1098
## 549 6890.9933        6227
## 43  1606.6884        1472
## 505 7486.1834        8294
## 455 5432.3238        5459
## 678 5416.9158        5315
## 134 3563.8571        3409
## 275 2805.0003        2918
## 152 4614.6465        3974
## 96  3521.0655        2808
## 681 5666.8000        6852
## 10  1092.0868        1321
## 602 6687.9923        7582
## 217 4855.7360        4866
## 103 2537.1257        2162
## 493 5706.4107        6273
## 266 5009.3035        2395
## 414 4326.6253        4318
## 365 2234.4421        2485
## 511 7259.3364        6734
## 514 6496.0449        6043
## 500 5707.6101        2843
## 525 7202.2055        7736
## 484 5350.3622        4220
## 369 2771.6942        2368
## 63  1693.6487        1944
## 575 6794.5243        6685
## 109 3397.6963        3204
## 391 3628.8853        4075
## 127 4697.7399        4714
## 462 5752.7083        6460
## 6   1852.4679        1606
## 279 4838.6921        4765
## 60  2195.0317        1851
## 648 5659.0598        6392
## 451 5520.5617        5558
## 179 4987.1026        4648
## 410 3726.7161        3922
## 282 4855.5788        5511
## 190 4574.9023        5336
## 17   -26.2820        1000
## 216 3925.0026        4576
## 207 4657.0306        4590
## 398 3728.2081        3761
## 407 1928.4787        2169
## 613 7166.4951        6864
## 378 3413.6381        3214
## 141 5041.9030        5805
## 592 6782.9130        6784
## 720 4913.3419        4128
## 310 3257.3918        3649
## 196 4342.4285        5538
## 582 7072.1783        6824
## 33   613.8389        1526
## 180 4828.6704        5225
## 139 3794.5784        4575
## 290 4864.1825        4570
## 75  1853.6084        2192
## 485 5873.6381        6304
## 280 5304.8178        4985
## 286 4216.4535        2913
## 527 7193.6850        6598
## 540 7106.5521        7458
## 231 4156.5917        4153
## 252 4402.9273        3544
## 411 4350.3431        4169
## 674 5258.1421        5107
## 629 6983.6091        7720
## 51  1737.5445        1812
## 259 4015.5856        4760
## 234 4521.3410        4758
## Start:  AIC=6823.63
## count ~ season + year + month + holiday + weekday + workingday + 
##     weathersit + temp + atemp + humid + windspd
## 
## 
## Step:  AIC=6823.63
## count ~ season + year + month + holiday + weekday + weathersit + 
##     temp + atemp + humid + windspd
## 
##              Df Sum of Sq       RSS    AIC
## - weekday     6   6356379 294873883 6822.8
## - atemp       1    698564 289216068 6822.9
## <none>                    288517504 6823.6
## - temp        1   2617350 291134854 6826.2
## - holiday     1   5396496 293914000 6831.1
## - humid       1  11952136 300469640 6842.4
## - windspd     1  23596949 312114454 6861.8
## - weathersit  1  24635902 313153407 6863.5
## - season      3  33491030 322008534 6873.8
## - month      11  46586296 335103800 6878.1
## - year        1 511058147 799575652 7342.5
## 
## Step:  AIC=6822.77
## count ~ season + year + month + holiday + weathersit + temp + 
##     atemp + humid + windspd
## 
##              Df Sum of Sq       RSS    AIC
## - atemp       1    391331 295265215 6821.4
## <none>                    294873883 6822.8
## - temp        1   3533858 298407742 6826.9
## - holiday     1   6821362 301695245 6832.5
## - humid       1  13858639 308732522 6844.2
## - weathersit  1  22455278 317329161 6858.3
## - windspd     1  23938053 318811936 6860.7
## - season      3  32383455 327257339 6870.0
## - month      11  45693105 340566989 6874.4
## - year        1 517941045 812814928 7338.9
## 
## Step:  AIC=6821.44
## count ~ season + year + month + holiday + weathersit + temp + 
##     humid + windspd
## 
##              Df Sum of Sq       RSS    AIC
## <none>                    295265215 6821.4
## - holiday     1   7013037 302278252 6831.4
## - humid       1  13615439 308880654 6842.5
## - weathersit  1  22937032 318202247 6857.7
## - windspd     1  25860063 321125278 6862.3
## - season      3  32536878 327802093 6868.9
## - month      11  45446975 340712190 6872.6
## - temp        1  58199721 353464936 6911.4
## - year        1 517570698 812835913 7336.9
## lm(formula = count ~ season + year + month + holiday + weathersit + 
##     temp + humid + windspd, data = train)
##     pred_log2 valid.count
## 245 5101.0470        4727
## 195 5686.9942        5084
## 580 6118.0724        7261
## 311 4916.8266        4035
## 695 4207.6974        2424
## 611 4869.7204        5810
## 723 2751.4358        1787
## 204 7351.6995        3285
## 615 4860.8709        6203
## 199 5691.4309        4458
## 319 5056.3125        4195
## 418 3628.2591        4773
## 429 2602.3634        3423
## 113 3482.0761        4036
## 41  2077.6811        1538
## 312 5207.8404        4205
## 697 3032.9026        3959
## 317 4966.0968        3717
## 517 6310.6621        7338
## 203 7634.9503        3387
## 708 4030.3045        5582
## 601 5855.4228        7765
## 702 3628.5019        4649
## 585 5608.6594        7273
## 57  2988.1085        1969
## 315 4585.6939        3368
## 154 6518.6755        5312
## 668 2863.2262          22
## 384 2206.8559        3292
## 149 4996.3244        4788
## 294 5419.6967        4304
## 151 6958.9713        3982
## 706 4240.3491        5375
## 715 4789.8449        5047
## 284 5215.2373        4563
## 479 2272.2913        3214
## 452 4709.6206        5102
## 242 5696.8779        5204
## 70  2494.2979        1977
## 29  2332.5656        1098
## 549 6319.4308        6227
## 43  2556.3767        1472
## 505 6486.3920        8294
## 455 4403.7303        5459
## 678 4659.5450        5315
## 134 4111.3374        3409
## 275 3244.7327        2918
## 152 6200.6768        3974
## 96  4606.3944        2808
## 681 4937.2431        6852
## 10  1859.0577        1321
## 602 5969.6412        7582
## 217 5987.1976        4866
## 103 3481.3361        2162
## 493 4441.1674        6273
## 266 5144.6691        2395
## 414 3501.9325        4318
## 365 3584.1812        2485
## 511 6151.6494        6734
## 514 6044.1849        6043
## 500 4325.9660        2843
## 525 6577.6801        7736
## 484 4448.0175        4220
## 369 1872.5322        2368
## 63  2364.1726        1944
## 575 6180.1587        6685
## 109 4597.1009        3204
## 391 2774.5991        4075
## 127 5573.9838        4714
## 462 4877.6032        6460
## 6   2954.7653        1606
## 279 5664.6255        4765
## 60  2698.5193        1851
## 648 4396.3379        6392
## 451 4153.5475        5558
## 179 6083.1296        4648
## 410 3133.5605        3922
## 282 5720.8821        5511
## 190 5913.3252        5336
## 17  1581.2976        1000
## 216 4865.4916        4576
## 207 6158.6192        4590
## 398 2912.8767        3761
## 407  707.7338        2169
## 613 5358.5883        6864
## 378 2359.7311        3214
## 141 5912.5872        5805
## 592 5731.1775        6784
## 720 4087.7696        4128
## 310 4441.9067        3649
## 196 5874.3195        5538
## 582 6180.4389        6824
## 33  1535.2159        1526
## 180 5913.9352        5225
## 139 4432.7771        4575
## 290 5742.4267        4570
## 75  2566.2879        2192
## 485 4827.7845        6304
## 280 6450.1790        4985
## 286 4962.8655        2913
## 527 6417.0133        6598
## 540 5993.1204        7458
## 231 5317.1377        4153
## 252 4505.0647        3544
## 411 3638.8774        4169
## 674 4432.6388        5107
## 629 5029.6588        7720
## 51  2831.5921        1812
## 259 4365.3394        4760
## 234 5491.5700        4758

4.2 Random Forest

Random forest trained with all inputs as results are not affected by multi-collinearity.

##                 Length Class      Mode     
## call              4    -none-     call     
## type              1    -none-     character
## predicted       511    -none-     numeric  
## mse             500    -none-     numeric  
## rsq             500    -none-     numeric  
## oob.times       511    -none-     numeric  
## importance       28    -none-     numeric  
## importanceSD      0    -none-     NULL     
## localImportance   0    -none-     NULL     
## proximity         0    -none-     NULL     
## ntree             1    -none-     numeric  
## mtry              1    -none-     numeric  
## forest           11    -none-     list     
## coefs             0    -none-     NULL     
## y               511    -none-     numeric  
## test              0    -none-     NULL     
## inbag             0    -none-     NULL     
## xNames           28    -none-     character
## problemType       1    -none-     character
## tuneValue         1    data.frame list     
## obsLevels         1    -none-     logical  
## param             0    -none-     list

5 Ensemble - Simple Average

After taking a look at all the predicted values of the 3 models generated, a combination of the 3 models seem to improve the model even further. The predicted values of all 3 models are added up, and the mean is taken.

##     combpred.avg_count testpredDF.count
## 1             2074.882             1600
## 2             1698.836             1927
## 3             1815.024             1096
## 4             3919.159             2927
## 5             1192.803             1107
## 6             2073.535             1685
## 7             2859.824             2133
## 8             2391.030              623
## 9             2872.640             2046
## 10            2399.426             1693
## 11            2717.156             1536
## 12            2625.201             1795
## 13            2502.175             2455
## 14            3667.243             3126
## 15            4744.281             3944
## 16            4565.173             4608
## 17            5023.174             4182
## 18            4520.635             4864
## 19            4145.239             4274
## 20            5011.919             4758
## 21            5411.871             4548
## 22            5700.599             4401
## 23            4950.781             5020
## 24            4280.183             4835
## 25            5278.944             5202
## 26            4715.699             4708
## 27            5534.900             5362
## 28            4247.892             4040
## 29            4787.763             3574
## 30            3861.135             3820
## 31            4676.038             4338
## 32            5138.503             5895
## 33            5163.571             4661
## 34            5106.529             4484
## 35            2566.307             2710
## 36            4935.359             5345
## 37            5215.693             4785
## 38            3255.182             2416
## 39            4528.447             3644
## 40            4606.360             4308
## 41            4189.373             4186
## 42            3838.426             3926
## 43            3075.469             3053
## 44            4260.287             2792
## 45            4079.546             3071
## 46            3625.516             3727
## 47            3959.692             3620
## 48            3652.589             3740
## 49            3532.465             3577
## 50            2712.614             3068
## 51            1711.346             1162
## 52            2088.254             2302
## 53            3663.788             3425
## 54            2608.211             2177
## 55            2738.140             2493
## 56            2425.748             2311
## 57            2751.424             2802
## 58            1894.750             1529
## 59            2666.665             2689
## 60            3058.675             3129
## 61            3860.078             4322
## 62            4027.260             4066
## 63            4071.945             4916
## 64            3935.190             4911
## 65            5049.974             5298
## 66            5947.711             6398
## 67            6212.993             6565
## 68            5927.017             6624
## 69            5179.156             5633
## 70            5165.953             5026
## 71            5500.520             6233
## 72            5084.455             5572
## 73            6032.964             6169
## 74            6719.505             7429
## 75            6409.833             6118
## 76            6018.747             5260
## 77            6401.092             6591
## 78            5783.529             4127
## 79            5626.545             7001
## 80            6762.934             6825
## 81            6852.766             6879
## 82            6287.605             6031
## 83            6620.359             6591
## 84            4148.225             4459
## 85            6457.985             7592
## 86            6743.655             8173
## 87            6225.134             5464
## 88            6759.493             7605
## 89            5672.157             6530
## 90            5493.805             6053
## 91            6809.111             7713
## 92            6719.588             6140
## 93            6520.428             7525
## 94            6879.013             8009
## 95            4776.544             4073
## 96            6662.653             8167
## 97            7081.165             8555
## 98            7239.723             8156
## 99            6444.839             7965
## 100           6270.561             7570
## 101           7111.833             7693
## 102           5995.340             7444
## 103           5883.374             7852
## 104           4622.118             5138
## 105           4807.794             5499
## 106           5666.270             3910
## 107           5589.277             5729
## 108           2545.245             1013
## 109           1990.668             2114
## 110           2416.977             1796

6 Simple Diagnostic Checks

Residuals vs Fitted: shows a random pattern
Normal Q-Q: the residuals fairly follow the straight line and are normally distributed, so we can assume normality
Scale-Location: The residuals look randomly spread out along the line.
Residuals vs Leverage: All cases are within boundaries of Cook’s distance, hence no outliers which are influential to the regression results.

7 Conclusion

By accurately predicting the number of bike rentals per day, companies can anticipate the bike volumes required and allocate resources efficiently to manage their bike fleet according to user demand.